This is a continuation of my MIS 665 Mid-Term project and will be utilized for my Final Project. I will be using what I learned in the second half of the semester. In this project, I will demonstrate my skills in modeling, evaluation and deployment, using regression, classification and clustering.
Austin, N. "On my honor, as a student, I have neither given nor received unauthorized aid on this academic work."
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline
#regression packages
import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
import statsmodels.api as sm
from statsmodels.formula.api import ols
# model validation
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from statsmodels.formula.api import ols
#f_regression (feature selection)
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest
# recursive feature selection (feature selection)
from sklearn.feature_selection import RFE
#import decisiontreeclassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from IPython.display import SVG
#from graphviz import Source
from IPython.display import display
#import logisticregression classifier
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
#import knn classifier
from sklearn.neighbors import KNeighborsClassifier
#for validating your classification model
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split, GridSearchCV
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
# feature selection
from sklearn.feature_selection import RFE
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#pip install scikit-plot (optional)
import scikitplot as skplt
import warnings
warnings.filterwarnings("ignore")
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances
from sklearn.cluster import ward_tree
from scipy.cluster.hierarchy import dendrogram, linkage, ward
# load csv file
df_full = pd.read_csv('data/movie_metadata.csv')
df_full.head()
- The data contains 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries.
- There are 2399 unique director names, and thousands of actors/actresses.
-- Source: https://data.world/popculture/imdb-5000-movie-dataset
| Variable Name | Description |
|---|---|
| movie_title | Title of the Movie |
| duration | Duration in minutes |
| director_name | Name of the Director of the Movie |
| director_facebook_likes | Number of likes of the Director on his Facebook Page |
| actor_1_name | Primary actor starring in the movie |
| actor_1_facebook_likes | Number of likes of the Actor_1 on his/her Facebook Page |
| actor_2_name | Other actor starring in the movie |
| actor_2_facebook_likes | Number of likes of the Actor_2 on his/her Facebook Page |
| actor_3_name | Other actor starring in the movie |
| actor_3_facebook_likes | Number of likes of the Actor_3 on his/her Facebook Page |
| num_user_for_reviews | Number of users who gave a review |
| num_critic_for_reviews | Number of critical reviews on imdb |
| num_voted_users | Number of people who voted for the movie |
| cast_total_facebook_likes | Total number of facebook likes of the entire cast of the movie |
| movie_facebook_likes | Number of Facebook likes in the movie page |
| plot_keywords | Keywords describing the movie plot |
| facenumber_in_poster | Number of the actor who featured in the movie poster |
| color | Film colorization. ‘Black and White’ or ‘Color’ |
| genres | Film categorization like ‘Animation’, ‘Comedy’, ‘Romance’, ‘Horror’, ‘Sci-Fi’, ‘Action’, ‘Family’ |
| title_year | The year in which the movie is released (1916:2016) |
| language | English, Arabic, Chinese, French, German, Danish, Italian, Japanese etc |
| country | Country where the movie is produced |
| content_rating | Content rating of the movie |
| aspect_ratio | Aspect ratio the movie was made in |
| movie_imdb_link | IMDB link of the movie |
| gross | Gross earnings of the movie in Dollars |
| budget | Budget of the movie in Dollars |
| imdb_score | IMDB Score of the movie on IMDB |
# Let's look at the columns as seen in the data set.
# We are using this verify that the data set we pulled in the same as the source data stated it should be.
for col in df_full.columns:
print(col)
# How many records are in the data set?
len(df_full)
# Look at the top 5 records to see what we might find needs cleaned up.
df_full.head()
- Gnere appears to be concatenated by | with multiple values. We will need to clean this up.
- We could also clean up the plot_keywords, if we feel this effects the idmb score
- We have null values in some of the columns.
# Do we have any null values?
df_full.isnull().sum().sort_values(ascending=False)
- It is intersting to see budget and gross have so many null values.
# Let's see how this looks on a bar chart.
df_full.isnull().sum().sort_values(ascending=False).plot(kind='barh',figsize=(8, 8))
plt.xlabel('Total Nulls')
plt.ylabel('Columns')
plt.title(" A display of null values for each column");
- Several columns have null values, we will need to clean these up.
- We might have to focus on the gross, budget, aspect ration and content rating
- Can we replace any of the other nulls with values, such as the mean or 0?
# Any other data issues?
# How many duplicated rows are in the dataset?
len(df_full[df_full.duplicated() == True])
- 45 rows are duplicates of other rows in the data set.
# What are the data types?
df_full.info()
# Let's describe the numbers in the dataset.
df_full.describe()
# Do we see any correlation issues between the imdb_score and other factors?
df_full.corr()
- Every column has a positive correlation to imdb score except for num user for reviews, and actor 2 facebook likes.
- Number of voted users has the highest positive correlation to imbd score.
plt.figure(figsize=(12,12))
sns.heatmap(df_full.corr(), vmax=.8, square=True, annot=True, fmt=".1f", cmap='Blues')
plt.title("Correlation on all columns");
# What is the break down of director facebook likes?
df_full['director_facebook_likes'].value_counts().sort_values(ascending=False).head().reset_index()
# What is the break down of duration likes?
df_full['duration'].value_counts().sort_values(ascending=False).head().reset_index()
# What is the break down of number of voted users likes?
df_full['num_voted_users'].value_counts().sort_values(ascending=False).head().reset_index()
# What is the break down of number of critics for reviews likes?
df_full['num_critic_for_reviews'].value_counts().sort_values(ascending=False).head().reset_index()
# What is the break down of movie facebook likes?
df_full['movie_facebook_likes'].value_counts().sort_values(ascending=False).head().reset_index()
# First let's check out total row count again.
len(df_full)
# Let's remove the duplicate rows, so they don't skew the data.
# Move from the df_full dataset to df
df = df_full.drop_duplicates()
len(df)
- We now have a total of 4998 after we dropped some duplicate rows.
# Earlier we found that we had null values.
# Let's go ahead and drop the nulls, so that correlate against rows with the most data.
# The following columns had the most nulls
# gross 884
# budget 492
df = df[df['gross'].notnull()]
df = df[df['budget'].notnull()]
# What is our count after dropping nulls in the gross column?
len(df)
- We are now left with 3857 records in the data set.
# How do our nulls look now?
df.isnull().sum().sort_values(ascending=False).plot(kind='barh',figsize=(8, 8))
plt.xlabel('Total Nulls')
plt.ylabel('Columns')
plt.title(" A display of null values for each column");
- Because aspect ratio and content_rating does not look to effect but only .02 % of the data set, we will leave it.
- But, we need to look to see if we can replace some values, now that we have a good set of data.
# What are our true counts of null data now?
df.isnull().sum().sort_values(ascending=False)
# Let's descrbibe the data that is left to set some values, based on mean()
df.describe()
- We can use this describe to decide which columns we should analyze.
# Replace values, let's go with the mean for all floats.
df = df.fillna({'num_critic_for_reviews': 163.0})
df = df.fillna({'duration': 110.0})
df = df.fillna({'actor_1_facebook_likes': 7576.0})
df = df.fillna({'actor_2_facebook_likes': 1959.0})
df = df.fillna({'actor_3_facebook_likes': 747.0})
df = df.fillna({'facenumber_in_poster': 1.4})
df = df.fillna({'aspect_ratio': 2.1})
df.isnull().sum().sort_values(ascending=False)
df.isnull().sum().sort_values(ascending=False).plot(kind='barh',figsize=(8,8))
plt.xlabel('Total Nulls')
plt.ylabel('Columns')
plt.title(" A display of null values for each column");
- Now it appears we have some fairly clean data. Let's see what else we need to do?
# For classification later, let's create a dataset
dfclass = df
dfclass = dfclass.dropna()
dfclass.head()
# We found that genre had multiple results concatenated by |, let's fix this.
# We will do this by using dummy columns to split them but keep them in the dataset.
# -- Pulled from the canvas forum. :)
df = df.join(df.pop('genres').str.get_dummies('|'))
df.head()
- We now have columns for the genres!
# Let's cteate a return_on_investment column, to see if this has a high correlation in later analysis.
df['return_on_investment'] = (df['gross']/df['budget'])*100
df.return_on_investment.head().reset_index()
- We can now see the return on investment in one column.
- Let's remove some columns that we don't need for this research.
# Is color important?
df['color'].value_counts().sort_values(ascending=False).head().reset_index()
- Over 96% of of the of the movies are in color, so this won't help us determine correlation. Let's drop color.
df= df.drop('color', axis=1)
# Is language important?
df['language'].value_counts().sort_values(ascending=False).head().reset_index()
- Over 98% of languages are English. This seems like a non factor. Let's remove language.
df= df.drop('language', axis=1)
# Is facenumber in poster important?
df['facenumber_in_poster'].value_counts().sort_values(ascending=False).head().reset_index()
# Let's count and display movie titles by year.
df['title_year'].hist(bins=150)
plt.xlabel('Movie Title Year')
plt.ylabel('Count of Movies each year')
plt.title("All movies listed by year");
- Should we remove all movies before 1980 since a majority are after 1980?
df = df.loc[df['title_year'] >= 1980]
df.groupby('title_year').size().plot()
plt.xlabel('Movie Title Year')
plt.ylabel('Count of Movies each year')
plt.title("Only movies listed by year after 1980");
- Sweet, now we only have data on or after 1980.
sns.distplot(df.title_year)
plt.xlabel('Movie Title Year')
plt.ylabel('Count of Movies each year')
plt.title("Standard Normalization chart of movies listed by year after 1980");
- This is an intersting way to view the same data.
# Top 20 movies titles
df.groupby('movie_title')['gross'].sum().sort_values(ascending=False).head(20).plot(kind='barh', figsize=(10,10));
plt.xlabel('Gross $million')
plt.ylabel('Movie Title')
plt.title("Top 20 movies based on gross");
- We can see that Avatar was clearly the breadwinner here in gross.
- Idea to use this correlation borrowed from http://rstudio-pubs-static.s3.amazonaws.com/342210_7c8d57cfdd784cf58dc077d3eb7a2ca3.html#add-columns
# IMDB_SCORE has quite a few scores, so let's break these down into BINS for easier plotting.
# create a new df
df_score = df
# setting my own values for bins
df_score['imdbscores_bins'] = pd.cut(df['imdb_score'], bins=[0, 2, 4, 6, 8, 10], labels=[1,2,3,4,5])
df_score.head()
- Thank you Dr. Chae for this helpful code. Instead of using multiple imdb scores to make the plots hard to read, we can now see our plots on imdb scores in 5 easy to read bins.
sns.catplot("imdbscores_bins", "gross", data=df, kind="violin",
height=7, aspect=2, palette="muted")
plt.xlabel('IMBD Scores')
plt.ylabel('Gross')
plt.title("IDMB Scores based on gross");
- The above violin plots show that the high grossing films were fewer between across all imdb scores, than wer the lower grossing films. Does this mean that there were not that many high grossing films?
sns.lmplot("imdbscores_bins", "movie_facebook_likes", df, order=2)
plt.xlabel('IDMB_SCORES')
plt.ylabel('Movie Facebook Likes')
plt.title("Facebook likes based on IMBD Scores");
- It appears that the imdb score of around 4 received the most movie facebook likes.
sns.lmplot("imdbscores_bins", "duration", df, order=2)
plt.xlabel('IMBD Scores')
plt.ylabel('Duration')
plt.title("IMDB Scores based on Duration");
- It is interesting to see that duration, related to facebook likes are oddly correlated.
# violin plot
sns.catplot("imdbscores_bins", "duration", data=df, kind="violin",
height=8, aspect=2, palette="muted")
plt.xlabel('IMBD Scores')
plt.ylabel('Duration')
plt.title("Duration by IMDB Scores");
- This shows me that the duration is pretty average regardles of what the imdb scores is.
plt.scatter(df['imdb_score'], df['movie_facebook_likes'])
plt.xlabel('IMDB Score')
plt.ylabel('Movie Facebook Likes')
plt.title("IMBD Score in relation to movie facebook likes");
- The higher the imdb score, clearly the more movie facebook likes. If you can get a high score, your fans will be happy. Try to stay in the 7-8 range.
# Earlier we peformed correlation to see what fields, we wanted to use for value_counts.
# Let's work to research correlation between several fields.
# We can now take a look at correlation on our imdb_scores and genres
df_corr = df_full[['imdb_score','genres']]
df_corr.head()
# Now let's split the genres by category
# -- Pulled from the canvas forum. :)
df_genres = df_corr.join(df_corr.pop('genres').str.get_dummies('|'))
df_genres.head()
# Now let's do correlation on these fewer columns
df_genres.corr()
- Of the following genres the following have positive correlation to imdb score.
- Animation - Biography - Crime - Documentary - Drama - Film-Noir - History - Musical - Mystery - News - Romance - Sport - War - Western
# Let's take a look at doing a heat map on a few of these fields.
plt.figure(figsize=(12,12))
sns.heatmap(df_genres.corr(), vmax=.8, square=True, annot=True, fmt=".1f", cmap='Purples')
plt.title("Correlation on all genres");
- Even though we were trying to identify what correlated high to imdb score, we can quickly see based on the heat map above, that Reality-Tv / Game Show is high, and Family Animation is also high, related to each other.
# -- Pulled from the canvas forum. :)
# Movie_Facebook_Likes has quite a few scores, so let's break these down into BINS for easier plotting.
# create a new df
df_likes = df_full
# setting my own values for bins
df_likes['movie_fb_likes_bins'] = pd.cut(df['movie_facebook_likes'], bins=[0, 10000, 25000, 50000, 75000, 100000, 125000, 150000, 175000, 200000, 225000, 250000, 275000, 300000, 325000, 350000], labels=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
df_likes.head()
# Now let's pull the movie_facebook_likes out and see what they correlate too.
df_likes.groupby('movie_fb_likes_bins').mean()
- From the mean() testing above, we can summarize the the following.
- The higher the movie facebook likes, the better the number of critic for reviews
- Bin number 8 of movie facebook likes, is highly correlated to director facebook likes
- Bin number 8 is also highly correlated to actor 3 facebook likes.
- Intersting though gross is more correlated to bin 3 on movie facebook likes.
# Let's look at this from a heatmap perspective
plt.figure(figsize=(12,12))
sns.heatmap(df_likes.corr(), vmax=.8, square=True, annot=True, fmt=".1f", cmap='Greens')
plt.title("Correlation on movie_facebook_likes");
- We see a few things from this heat map.
- Movie Facebook_Likes is highly correlated to number of critic for review, similar to what we saw above.
- What really stands out is the actor 1 facebook likes, are highly attributed to the cast total facebook likes.
- Number of voted users also is related heavily to number of users for review.
# Credit: https://colab.research.google.com/drive/1nSqg5YEr1Hd8uK0utqasqcav06Hgm2ev#scrollTo=u_9cRmtVZ6ba
# Let's look at the imdb score along side num_voted_users.
# Let's show the breakdown by title_year, and display the budget for year on hover.
# The size of the bubble will be the movies aspect_ratio.
px.scatter(df, x="imdb_score", y="num_voted_users", color="title_year", hover_name='budget', size='aspect_ratio')
- This graph appears to show that a majority of the imbd_scores are in the 2007 - 2015 range, and the aspect ratios are about the same across the board. The higher the imdb_score, the higher the number of voters.
# Credit: https://colab.research.google.com/drive/1nSqg5YEr1Hd8uK0utqasqcav06Hgm2ev#scrollTo=u_9cRmtVZ6ba
# Let's look at the imdb score along side duration.
# Let's show the breakdown by title_year, and display the budget for each year on hover.
# The size of theb bubble will be how much the movie grosseed
px.scatter(df, x="imdb_score", y="duration", color="title_year", hover_name='budget', size='gross')
This graph appears to show the imdb_scores are generally in the range of 6-8 with a fairly low duration. This then breaks down the data by title year, and allows you to see the budget and gross on hover.
# Credit: https://colab.research.google.com/drive/1nSqg5YEr1Hd8uK0utqasqcav06Hgm2ev#scrollTo=u_9cRmtVZ6ba
px.scatter(df, x="imdb_score", y="duration", color="title_year", text='director_name')
- Looking at this plotly chart it is clearly shows that a majority of the movies range in the 6 - 8 imbd_score, and that their movie durations are fairly low. It is a rare instance where the movie duration is really long, and they receive a high imdb_score. See Ron Maxwell, Taylor Hackford, etc.
px.scatter(df, x="imdb_score", y="title_year", trendline='ols')
- Even though the ols trend line is going down, this isn't a bad thing. This tells me that the imdb_scores overall lean toward the higher score, but the chart also shows us that the newer the movie the higher the imdb_score. It makes me wonder how much social media plays into this factor.
Analysis:
Story telling: Overall suggestions and implications
#assigning columns to X and Y variables
X = df['movie_facebook_likes']
y = df['imdb_score']
# We create the model and call it lr.
model1 = lm.LinearRegression()
# We train the model on our training dataset.
model1.fit(X[:,np.newaxis], y) ## X needs to be 2d for LinearRegression so add [:,np.newaxis]
# Now, we predict points with our trained model.
model1_y = model1.predict(X[:,np.newaxis])
# The coefficients
print('Coefficients: ', model1.coef_)
# y-intercept
print("y-intercept ", model1.intercept_)
Linear Regression Model: y = 1.4x + 6.3
One unit increase in movie_facebook_likes increases imdb_score by about 6.3
# try to evaluate the performance of our model's prediction using visualization
plt.subplots()
plt.scatter(y, model1_y)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4) #dotted line represents perfect prediction (actual = predicted)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
# Let's build 2nd model
X = df['duration']
y = df['imdb_score']
model2 = lm.LinearRegression()
model2.fit(X[:,np.newaxis], y)
model2_y = model2.predict(X[:,np.newaxis])
print('Coefficients: ', model2.coef_)
print("y-intercept ", model2.intercept_)
# try to evaluate the performance of our model's prediction using visualization
plt.subplots()
plt.scatter(y, model2_y)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4) #dotted line represents perfect prediction (actual = predicted)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
# Choose a different variable as X and develop another (3rd) linear regression model (model3).
X = df['director_facebook_likes']
y = df['imdb_score']
model3 = lm.LinearRegression()
model3.fit(X[:,np.newaxis], y)
model3_y = model2.predict(X[:,np.newaxis])
print('Coefficients: ', model3.coef_)
print("y-intercept ", model2.intercept_)
# try to evaluate the performance of our model's prediction using visualization
plt.subplots()
plt.scatter(y, model3_y)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4) #dotted line represents perfect prediction (actual = predicted)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
print("%.10f" % model3.coef_)
print("mean square error: ", mean_squared_error(y, model1_y))
print(explained_variance_score(y, model1_y))
print("mean square error: ", mean_squared_error(y, model2_y))
print(explained_variance_score(y, model2_y))
# evaluate your model (3rd)
print("mean square error: ", mean_squared_error(y, model3_y))
print(explained_variance_score(y, model3_y))
# lm = scikit-learn
#assigning columns to X and Y variables
# Let's use the highest positive correlating value as seen above in the corr() section.
X = df['director_facebook_likes']
y = df['imdb_score']
# First, we create the model and call it lm.
model1 = lm.LinearRegression()
# Second, we train the model on our training dataset.
model1.fit(X[:,np.newaxis], y) ## X needs to be 2d for LinearRegression so add [:,np.newaxis]
# Now, we predict points with our trained model.
model1_y = model1.predict(X[:,np.newaxis])
# The coefficients
print('Coefficients: ', model1.coef_)
# y-intercept
print("y-intercept ", model1.intercept_)
Linear Regression Model: y = 6.68 + 6.39
A six unit increase in director_facebook_likes increases imdb_score by about 6.
# Let's evaluate the performance of our model's prediction using visualization
plt.subplots()
plt.scatter(y, model1_y)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
#dotted line represents perfect prediction (actual = predicted)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title("Actual vs predicted")
plt.show()
# Let's see what model1 returns
model1
# print the coefficients and the y-intercept
print('Coefficients: ', model1.coef_)
print("y-intercept ", model1.intercept_)
# Did we get a low MSE?
print("mean square error: ", mean_squared_error(y, model1_y))
print("variance or r-squared: ", explained_variance_score(y, model1_y))
Now we run the mean square error and r variance. We have a low MSE but a low R-Squared. This isn't a good regression model.
# First we need to define y and X.
y1 = df['imdb_score']
## Instead of choosing all of the other columns we are doing to drop "y" for X columns.
X1 = df.drop(['imdb_score','director_name','actor_2_name','actor_1_name','actor_3_name','movie_title','plot_keywords','movie_imdb_link','country','content_rating'], axis =1)
# Verify the X values are which columns. Retrun one row.
X1.head(1)
# Time to fit the model to lasso()
model2 = lm.Lasso(alpha=0.1) #higher alpha (penality parameter), fewer predictors
model2.fit(X1, y1)
model2_y = model2.predict(X1)
# Let's see what model1 returns
model2
# print the coefficients and the y-intercept
print('Coefficients: ', model2.coef_)
print("y-intercept ", model2.intercept_)
# Did we get a low MSE?
print("mean square error: ", mean_squared_error(y, model2_y))
print("variance or r-squared: ", explained_variance_score(y, model2_y))
Stats model is not a good model in this multi-class analysis. The variance is really high and closer to 1 than 0.
Let's start by defining what each imdb_score represents categorically.
- Create the column by “binning” the imdb_score into 4 categories (or buckets): “less than 4, 4-6, 6-8 and 8-10, which represents bad, OK, good and excellent respectively”
dfclass['imdb_category'] = pd.cut(dfclass['imdb_score'], bins=[0, 4, 6, 8, 10], labels=[4,6,8,10])
dfclass.head()
# Comparing values to Duration, see if they have any factor on IMDB Score.
# pivot table using both IMDB Category and Movie Facebook Likes
dfclass.groupby(['movie_facebook_likes', 'imdb_category']).size().sort_values(ascending=False).plot(figsize=(10,5))
plt.xlabel('Movie Facebook Likes, IMDB Category')
plt.ylabel('Count')
plt.title("IMDB Category grouped by Movie Facebook Likes");
# Comparing values to Duration, see if they have any factor on IMDB Score.
# pivot table using both IMDB Category and Duration
dfclass.groupby(['duration', 'imdb_category']).size().sort_values(ascending=False).plot(figsize=(10,5))
plt.xlabel('Duration, IMDB Category')
plt.ylabel('Count')
plt.title("IMDB Category grouped by Duration");
# Let's see what columns are objects, based on text in the results.
# We will need to remove these object columns in the next step.
dfclass.head().T
# For classification to work, we need only integer columns.
# Drop all object columns
dfint = dfclass.drop(['gross','genres','budget','color','imdb_score','director_name','actor_2_name','actor_1_name','actor_3_name','movie_title','plot_keywords','movie_imdb_link','country','content_rating','language'], axis = 1)
dfint.head().T
# Let's look at the break down by imdb_category
# 4 = bad
# 6 = ok
# 8 = good
# 10 = excellent
dfclass.groupby('imdb_category').size()
# Before we build our models and then declare them, let's set X and Y
y = dfint['imdb_category']
X = dfint.drop(['imdb_category'], axis = 1) # put everything else into X
print(y.shape, X.shape)
# Split validation:train (70%) and test sets (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize decisiontreeclassifier()
dt = DecisionTreeClassifier()
# Train the model
dt = dt.fit(X_train, y_train)
dt
#Model evaluation
print(metrics.accuracy_score(y_test, dt.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, dt.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, dt.predict(X_test)))
# print("--------------------------------------------------------")
# print(metrics.roc_auc_score(y_test, dt.predict(X_test)))
# Visualize decision tree
from graphviz import Source
from sklearn import tree
Source( tree.export_graphviz(dt, out_file=None, feature_names=X.columns))
# visualizing the new decision tree (2nd option)
from sklearn.externals.six import StringIO
import pydotplus
dot_data = StringIO()
tree.export_graphviz(dt, out_file=dot_data, feature_names=X.columns,
filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("data/dt.pdf")
# evaluate the model by splitting into train and test sets & develop knn model (name it as knn)
# split validation - validate your model before you run your model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# random state keeps the same 42 people each time.
# Initialize KNeighborsClassifier() ... name your decision model "knn"
knn = KNeighborsClassifier() # default = 5 ... see below
# Train a decision tree model
# knn # empty model, we need to train the algorithm using fit
knn = knn.fit(X_train, y_train)
knn
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html
print(metrics.accuracy_score(y_test, knn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.confusion_matrix(y_test, knn.predict(X_test)))
print("--------------------------------------------------------")
print(metrics.classification_report(y_test, knn.predict(X_test)))
# print("--------------------------------------------------------")
# print(metrics.roc_auc_score(y_test, knn.predict(X_test)))
# evaluate the knn model using 10-fold cross-validation
scores = cross_val_score(KNeighborsClassifier(), X, y, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())
#create a dictionary of all values we want to test for n_neighbors
params_knn = {'n_neighbors': np.arange(1, 25)}
#use gridsearch to test all values for n_neighbors
knn_gs = GridSearchCV(knn, params_knn, cv=5, iid=False)
#fit model to training data
knn_gs.fit(X_train, y_train)
#save best model
knn_best = knn_gs.best_estimator_
#check best n_neigbors value
print(knn_gs.best_score_)
print(knn_gs.best_params_)
print(knn_gs.best_estimator_)
Based on the data above, KNN is the best but not by much.
# Before we can cluster we need a clean set of data without objects.
# Drop all object columns
dfcluster = dfclass.drop(['imdb_category','gross','genres','budget','color','director_name','actor_2_name','actor_1_name','actor_3_name','movie_title','plot_keywords','movie_imdb_link','country','content_rating','language'], axis = 1)
dfcluster.head().T
# variance test
dfcluster.var()
# normalize the data!
df_norm = (dfcluster - dfcluster.mean()) / (dfcluster.max() - dfcluster.min())
df_norm.head()
# variance test after normalization
df_norm.var()
#two clusters
k_means = KMeans(init='k-means++', n_clusters=2, random_state=0)
k_means.fit(df_norm)
# clustering analysis with k = 2
#clustering results
k_means.labels_
# find out cluster centers
k_means.cluster_centers_
# convert cluster lables to dataframe
df1 = pd.DataFrame(k_means.labels_, columns = ['cluster'])
df1.head()
# Look at the cluster breakdown
df1.groupby('cluster').size()
# join df_norm & df1
df2 = df_norm.join(df1)
df2.head()
# What are the profiles for each cluster?
df2.groupby(['cluster']).mean()
# don't use the normlized data, but go back to the original clustered data
X = (dfcluster - dfcluster.mean()) / (dfcluster.max() - dfcluster.min())
X.head()
np.random.seed(1) # setting random seed to get the same results each time.
agg= AgglomerativeClustering(n_clusters=4, linkage='ward').fit(X)
agg.labels_
plt.figure(figsize=(16,8))
linkage_matrix = ward(X)
dendrogram(linkage_matrix, orientation="left")
plt.tight_layout() # fixes margins
plt.figure(figsize=(16,8))
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
linkage_matrix = ward(X)
dendrogram(linkage_matrix,
#truncate_mode='lastp', # show only the last p merged clusters
#p=12, # show only the last p merged clusters
#show_leaf_counts=False, # otherwise numbers in brackets are counts
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True, # to get a distribution impression in truncated branches
orientation="top")
plt.tight_layout() # fixes margins
plt.figure(figsize=(16,8))
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
linkage_matrix = ward(X)
dendrogram(linkage_matrix,
truncate_mode='lastp', # show only the last p merged clusters
p=4, # show only the last p merged clusters
#show_leaf_counts=False, # otherwise numbers in brackets are counts
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True, # to get a distribution impression in truncated branches
orientation="top")
plt.tight_layout() # fixes margins
#To add cluster label into the dataset as a column
df1 = pd.DataFrame(agg.labels_, columns = ['cluster'])
df1.head()
df2 = df.join(df1)
df2.head()
# What are the profiles for the df2 non normal data?
df2.groupby('cluster').mean()
# Look at the cluster breakdown
df2.groupby('cluster').size()
sns.lmplot("cluster", "movie_facebook_likes", df2, x_jitter=.15, y_jitter=.15)
sns.lmplot("cluster", "duration", df2, x_jitter=.15, y_jitter=.15)
sns.lmplot("cluster", "num_voted_users", df2, x_jitter=.15, y_jitter=.15)
# join df & df1
df3 = dfcluster.join(df1)
df3.head()
# Look at the cluster breakdown
df3.groupby('cluster').size()
# Look at the profiles for each cluster.
df3.groupby('cluster').mean()
df3.groupby('cluster')['director_facebook_likes'].mean()
sns.lmplot("cluster", "director_facebook_likes", df3, x_jitter=.15, y_jitter=.15);
We can see that cluster 0, 1 and 3 have the higher diretor_facebook_likes
- What did we learn in this project?
Our category breakdown was interesting.
- imdb_category
- 4 (bad) 95
- 6 (OK) 1055
- 8 (good) 2467
- 10 (excellent) 158
Based on the data above, the best model is "KNN"